Training a CNN model with the CIFAR-10 dataset in ML Engine

The trainer package source is inside the cifar10 directory. It was based from Tensorflow's CNN tutorial and one of the Datalab image classification example.

Enable the ML Engine API

We need to enable the ML Engine API since it isn't by default.

  1. Head back to the web console.
  2. Search for "API Manager" using the bar on the top middle of the page.
  3. Select Library from the sidebar.
  4. Search for "ML Engine" and select Google Cloud Machine Learning Engine.
  5. Click ENABLE

Build the trainer package


In [ ]:
%%bash
cd cifar10

# Clean old builds
rm -rf build dist

# Build wheel distribution
python setup.py bdist_wheel --universal

# Check the built package
ls -al dist

Submit the training job to ML Engine


In [ ]:
%%bash
cd cifar10

# Set some variables
JOB_NAME=cifar10_train_$(date +%s)
BUCKET_NAME=dost_deeplearning_cifar10 # Change this to your own!
TRAINING_PACKAGE_PATH=dist/trainer-0.0.0-py2.py3-none-any.whl

# Submit the job through the gcloud tool
gcloud ml-engine jobs submit training \
  $JOB_NAME \
  --region us-east1 \
  --job-dir gs://$BUCKET_NAME/$JOB_NAME \
  --packages $TRAINING_PACKAGE_PATH \
  --module-name trainer.task \
  --config config.yaml

It will take a few minutes for ML Engine to provision a training instance for our job. While that's happening, let's talk about pricing!

TensorBoard


In [ ]:
import os.path
from google.datalab.ml import TensorBoard

bucket_path = 'gs://dost_deeplearning_cifar10'  # Change this to your own bucket
job_name = 'cifar10_train_1499874404'           # Change this to your own job name
train_dir = os.path.join(bucket_path, job_name, 'train')

TensorBoard.start(train_dir)

Now what?

Training will finish in around 8-9 hours. Make sure your training job is running properly before going!

We will deploy our trained model tomorrow and integrate it with a web app to run predictions on arbitrary images.

🌟 Challenge

Ideally, you'd want to evaluate your model every X steps while training to get a log of your accuracy values.

There's an eval.py module in the trainer package that's a slightly modified copy of cifar_eval.py from the TensorFlow CIFAR-10 tutorial. We're not using it yet though. Try adding this evaluation step and re-running your training job. Don't stop our previous training job!

TIP: You can add the evaluation step as a hook in our MonitoredTrainingSession. Take a look at _LoggerHook for an example.